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ABSTRACT 

Recent developments in structural equation modeling have 
produced several methods that can usually distinguish cause 
from effect in the two-variable case. For that purpose, how¬ 
ever, one has to impose substantial structural constraints 
or smoothness assumptions on the functional causal mod¬ 
els. In this paper, we consider the problem of determining 
the causal direction from a related but different point of 
view, and propose a new framework for causal direction de¬ 
termination. We show that it is possible to perform causal 
inference based on the condition that the cause is “exoge¬ 
nous” for the parameters involved in the generating pro¬ 
cess from the cause to the effect. In this way, we avoid 
the structural constraints required by the SEM-based ap¬ 
proaches. In particular, we exploit nonparametric methods 
to estimate marginal and conditional distributions, and pro¬ 
pose a bootstrap-based approach to test for the exogeneity 
condition; the testing results indicate the causal direction 
between two variables. The proposed method is validated 
on both synthetic and real data. 

Categories and Subject Descriptors 

H. 4 [Information Systems Applications]: Miscellaneous; 

I. 2.4 [Artificial Intelligence): Knowledge Representation 
Formalisms and Methods —Miscellaneous 

General Terms 

Algorithms, Theory 

Keywords 

Causal discovery, causal direction, exogeneity, statistical in¬ 
dependence, bootstrap 

1. INTRODUCTION 

Understanding causal relations allows us to predict the ef¬ 
fect of changes in a system and control the behavior of the 


system. Since randomized experiments are usually expen¬ 
sive and often impracticable, causal discovery from non- 
experimental data has attracted much interest [18| |23| . To 
do this, it is crucial to find (statistical) properties in the non- 
experimental data that give clues about causal relations. For 
instance, under the causal Markov condition and faithfulness 
assumption, the causal structure can be partially estimated 
by constraint-based methods, which make use of conditional 
independence relationships. 

Here we are concerned with the two-variable case, in which 
constraint-based methods, such as the PC algorithm [23| , 
do not apply. We assume that the given observations are 
i.i.d., i.e., there is no temporal information. Recently, causal 
discovery based on structural equation models (SEMs) has 
proved useful in distinguishing cause from effect [21[ |27| 

|26||15[[^ ; however, the performance of such approaches de¬ 
pends on assumptions on the functional model class and/or 
on the data-generating functions. On the other hand, there 
have been attempts in different fields to characterize proper¬ 
ties related to causal systems. One such concept (or family 
of concepts) is known as exogeneity, which is salient in econo¬ 
metrics [^[^. Roughly speaking, the notion expresses the 
property that the process that determines one variable X is 
in some sense separate from or independent of the process 
that determines another variable, say Y, given the value of 
A. 

The sense of “separateness” or “independence” in the rough 
idea has been specified in several ways for different purposes, 
which result in different concepts of exogeneity. The concept 
that is most relevant in this paper is the one in the context of 
model reduction, which was originally proposed as a condi¬ 
tion that justifies inferences about the parameters of interest 
based on the conditional likelihood function rather than the 
joint likelihood function [^. Here is the basic idea. Suppose 
the joint distribution of (A, Y) can be factorized as 

p{x,Y\e,^)=p(X\x,^)p{x\e). (i) 

where the conditional distribution p{Y\X) is parameterized 
by tp alone, and the marginal distribution p{X) by 9 alone. 
According to , A is said to be exogenous for tp (or any 

parameter of interest that is a function of p}), if i/> and 6 are 
variation fre^ or in other words, are not subject to ‘cross- 

^This is actually the definition of “weak exogeneity” in [^, 
where three types of exogeneity were defined. Here we con¬ 
sider the i.i.d. case where there is no temporal information. 



restrictions”. From the frequentist point of view, this implies 
that tp and 6 are independently estimable: the MLE of ip 
and that of 9 are statistically independent according to the 
sampling distribution. From the Bayesian point of view , 
this implies that ip and 9 are a posteriori independent given 
independent priors on them. 

In this paper we will exploit the above idea to develop a 
test of whether there exists a parameterization {9, ip) for 
p{X, Y) such that X is exogenous for ip, the parameters for 
p{Y\X). The test is based on bootstrap and is applicable 
in nonparametric settings. We will also argue that if X is a 
cause of Y and there is no confounding, then there should 
exist a parameterization such that X is exogenous for the 
parameters for p{Y\X). Thus the nonparametric test can 
be used to indicate the causal direction between two vari¬ 
ables, when the test passes for one direction but fails for the 
other. Compared to the SEM-based approach, an impor¬ 
tant novelty of this work is to use exogeneity as a new cri¬ 
terion for causal discovery in general settings, which allows 
distinguishing cause from effect and detecting confounders 
without structural constraints on the causal mechanism^ 

2. EXOGENEITY AND CAUSALITY 

In this section we define what “exogeneity” means in this 
paper, and explain its link to causal asymmetry. The con¬ 
cept of exogeneity we will use is adapted from the concept 
known in econometrics as weak exogeneity, which is in itself 
a statistical rather than a causal concept]^ We will show 
that this statistical notion can nonetheless be exploited to 
formulate a method that can often determine the causal di¬ 
rection between two variables. 

2.1 Exogeneity 

The concept of weak exogeneity, as formulated by Engle, 
Hendry, and Richard (EHR) [^, is concerned with when ef¬ 
ficient estimation of a set of parameters of interest can be 
made in a conditional submodel. For the purpose of this 
paper, suppose we are given two continuous random vari¬ 
ables X and Y, on which we have i.i.d. observations that 
are drawn according to a joint density p{X, Y\(p). By a repa¬ 
rameterization we mean a one-to-one transformation of the 
parameter set (p. Our dehnition below is adapted from the 
EHR definition, adjusted for our present purpose and setup: 


Definition 1 (Exogeneity of X forpCFIX)). Suppose 
p{X, Y) is parameterized by cp. X is said to be exogenous for 
the conditional P{Y\X) (or simply, exogenous relative toY) 
if and only if there exists a reparameterization (p —^ {9, ip), 
such that 

and consequently strong exogeneity in and weak exogene¬ 
ity conincide. 

related criterion is that of algorithmic independence 
between the input distribution p(X) and the co ndit ional 
v(Y\ X) postulated for a causal system X —> T [^; see 
also [M . The algorithmic independence condition is dehned 
in terms of Kolmogorov complexity, which is uncomputable, 
and the method proposed in this paper provides an alter¬ 
native way to assess the “independence” between p{X) and 

p(y|x). 

®The stronger, causal concept of exogeneity is known as su¬ 
per exogeneity. 


(t.) p{X,Y\9,ip) =p{Y\X,ip)p{X\9), and 

(a.) 9 and ip are variation free, i.e., {9, ip) G 0 x 4', where 

0 and T denote the set of admissible values of 9 and ip, 

respectively. 

Here “variation free” means that the possible values that 
one parameter set can take do not depend on the values 
of the other set. Clauses (r.) and (ii.) in Definition 
are the dehning conditions for the concept of a (classical) 
cut: [{Y\X;ip), (X;0)] is said to operate a (classical) cut on 
p{X,Y\9,ip) if (i.) and (ii.) are satisfied. The cut implies 
that the maximum likelihood estimates of 9 and ip can be 
computed from p{X\9) and p{Y\X,ip), respectively, and so 
the MLEs 9 and ip are independent according to the sam¬ 
pling distribution. The concept of exogeneity formalizes the 
idea that the mechanism generating the exogenous variable 
X does not contain any relevant information about the pa¬ 
rameter set Ip for the conditional model p{Y\X). 

The concept of cut also has a Bayesian version: |16] . 

Definition 2 (Bayesian cut). [{Y\X-, ip), {X-, 9)] oper¬ 
ates a Bayesian cut on p{X,Y\9,ip) if 
(i.) Ip and 9 are independent a priori, i.e., ip it 9, 

(ii.) 9 is sufficient for the marginal process of generating X, 
i.e.. Ip i X\9, and 

(iri.) Ip is sufficient for the conditional process of qeneratinq 
Y given X, i.e., 9 X y|(V>,X). 

A Bayesian cut allows a complete separation of inference 
(on 9) in the marginal model and of inference (on ip) in the 
conditional model. The prior independence between 9 and 
Ip in the Bayesian cut is a counterpart to the variation-free 
condition in the classical cut, and the last two conditions in 
Dehnitionj^implies condition (i.) in Definition]^ Thus, the 
Bayesian cut is equivalent to the classical cut in sampling 
theory, and for the purpose of this paper can be regarded as 
interchangeable. Therefore, the exogeneity of X relative to 
y can also be defined as that there exists a reparameteriza¬ 
tion {9, Ip) of p(X, y) such that [{Y\X-,ip),{X-,9)] operates 
a Bayesian cut on p{X, Y\9, ip). 

2.2 Possible Situations Where the Parameter¬ 
ization Fails to Operate a Bayesian Cut 

Fig. a) shows a data-generating process of X and Y from 
where [(y|X; ip), (X; 0)] operates a Bayesian cut. Note that 
in Definition]^ the two requirements of sufficiency of ip and 
9 for the marginal and the conditional (conditions (ii.) and 
(ill.)), respectively, are only restrictive under the assump¬ 
tion of prior independence of 9 and ip (condition (i.)); oth¬ 
erwise, conditions (ii.) and (Hi.) can be trivially met by, 
for example, taking 9 and ip to be the same. In fact, any 
two conditions in Definition co uld be trivial, given that 
the other does not hold. Fig. JHb-d) shows the situations 
where conditions (i.), (ii.), and (Hi.) are violated, respec¬ 
tively. In all those situations, 9 and ip are not independent 
a posteriori. 

2.3 Relation to Causality 

As Pearl ]^ rightly stressed, the EHR concept of weak ex¬ 
ogeneity is a statistical rather than a causal notion. Unlike 




Figure 1: Graphical representation of the data- 
generating process, (a) [(V\X-,tp),{X-,6)] operates a 
Bayesian cut (implying that X and tp are mutually 
exogenous), (b), (c), and (d) correspond to three 
situations where [{Y\X-,tp),{X-,6)] does not operate a 
Bayesian cut: (b) tp and 9 are dependent a priori, as 
both of them depend on 7 , which is a function of 9 
or %p\ (c) 9 is not sufficient in modeling the marginal 
distribution of X, where 7 is a function of ip\ (d) ip 
is not sufficient in modeling the conditional distri¬ 
bution of Y given X, where 7 is a function of 9. 


the concept of super exogeneity, it is not defined in terms of 
interventions or multiple regimes. That is why, as we will 
show, the hypothesis that X is exogenous relative to Y in 
the sense we defined is generally testable by observational 
data. However, it is also linked to causality in that it is 
arguably a necessary condition for an unconfounded causal 
relation: if X is a cause of Y and there is no common cause 
of X and Y, then X is exogenous relative to Y in the sense 
we definedlj This follows from the principle we indicated at 
the beginning: if X is an unconfounded cause of Y, then the 
process or mechanism that determines X is separate or in¬ 
dependent from the process or mechanism that determines 
Y given X. The separation of processes ensures the exis¬ 
tence of separate parameterizations of the processes, which 
will then satisfy our definition of exogeneity. 

We have argued that if X and Y are causally related and 
unconfounded, the exogeneity property holds for the correct 
causal direction. Furthermore, if it turns out that there is 
one and only one direction that admits exogeneity, then the 
direction for which the exogeneity property holds must be 
the correct causal direction. This suggests the following ap¬ 
proach to inferring the causal direction between X and Y 
based on some tests of exogeneity, assuming that X and Y 
are causally related and that there is no common cause of 
X and Y (or in other words, X and Y form a causally sujji- 
cient system): test whether (1) X is exogenous for p{Y\X) 
and whether (2) Y is exogenous for p{X\Y), and if one of 
them holds and the other does not, we can infer the causal 
direction accordingly. Of course it may also turn out that 
neither ( 1 ) nor ( 2 ) holds, which will indicate that the as¬ 
sumption of causal sufficiency is not appropriate, or that 
both ( 1 ) and ( 2 ) hold, which will indicate that the causal 

^In this paper we use ’’unconfounded” to mean the absence 
of any common cause. 


direction in question is not identifiable by our criterion]^ 

A familiar example of a non-identifiable situation is when 
X and Y follow a bivariate normal distribution. In that 
case, as shown by EHR [^, there is a cut [(y|X; ip), (X; 0)] 
in one direction, as well as a cut [{X\Y-,ip),{Y;9)] in the 
other. Below we give an example where the causal direction 
is identifiable based on exogeneity. 

An example of identifiable situation: Linear non- 
Gaussian case. 


Let X follow a Gaussian mixture model with two Gaussians, 
X ~ where ni > 0 and tti 112 = 1, 

and let y = c -|- /3X + E where E ~ A/'(0,o'^). Therefore 
9 — {-Ki, and ip = {c,/3, cr^}. We then have 

p(X, y |6I, I3) = Y^ -KiJV{x-, Pi,ai)jv{y; c -f fix, a^) 

i 

= ^ TTiMiv, Pi, 5-i)J\f{x; Ci -I- Pip, ')f). 


where pi = c-\- Ppi, af = -f 




and 'yf = 




. That is, 


^ 3 ^ 2+,^2 . P ^ - 


Y 

p{X\Y,9,ip) 


2 

and 

i=l 


E 'KiN{y,pi,a'l) 

p{Y\9,ip) 


■ M{x-,h + Piy,^'^)- 


Clearly, if 7 ri 7 r 2 yf 0, no matter how one parametrizes the 
density of y , the conditional distribution of X given Y would 
involves those parameters that model the marginal density 
of y. The sufficient parameter set of the distribution of 
y, 9, and that of the conditional distribution of X given 
y. Ip, cannot be variation-free or independent a priori; see 
Fig.^b). Alternatively, one can keep those parameters that 
are independent a priori from 9 in ip, i.e., ip and 9 become 
independent a priori, but ip is then not sufficient in modeling 
p(X|y); see Fig. m- In both situations Y is not exogenous 
for Ip. Hence in this linear non-Gaussian case the exogeneity 
condition only holds for the direction X —> y, and the causal 
direction is identifiable. Fig.j^gives an intuitive illustration 
on how the shape of P{Y) s-nd that of E{X\Y), which is 
determined by P{X\Y), are related. 

3. CAUSAL DIRECTION DETERMINATION 
BY TESTING FOR EXOGENEITY WITH 
BOOTSTRAP 

We now describe our approach to testing exogeneity. We will 
first illustrate how bootstrap can be used to test whether a 
given parametric model constitutes a (Bayesian) cut, and 
then develop a nonparametric test for exogeneity based on 
bootstrap. 

®Note that we are not concerned with the case in which 
X and y are not causally connected and hence statistically 
independent; in that case, exogeneity trivially holds in both 
directions. 
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is N ^ 80). 



Figure 2: An illutration on the identiflability of a 
linear non-Gaussian model based on “exogeneity”. 
X is generated by a mixture of two Gaussians, and 

Y is generated by Y = X + E, where E ~ A/^(0,1). 
Here X is exogenous for parameters in Py\xi while 

Y is not exogenous for parameters in px\Y- 


3.1 Bootstrap-Based Test for Bayesian Cut in 
the Parametric Case 

In this section, we assume that a parametric form p{X, F|0, ■(/;) = 
p{X\6)p{Y\X,ip) is given. We would like to see whether the 
estimates of 9 and of 'tp in 0 are independent, according 
to the sampling distribution; in other words, with a nonin- 
formative prior, we want to test if the posterior distribution 
p{9,il}\'D) has no conpling between 6 and ip. In this case we 
are examining if [{Y\X;ip), {X; 0)] operates a Bayesian cut. 


On the bootstrap resamples, logp*^*’^(X = x) is fully deter¬ 
mined by 0**^*”^; similarly, logp*^*’^(y|X = x) is a function of 
'0*^*’\andso is the quantity (x) = Ey|x logp**-*’* (F |X = 
x). Note that p*^*‘^{Y\X = xP) is the estimated distribution 
of F at X = Xi, and hence can be considered as 

negative entropies of F on the 6th bootstrap resample eval¬ 
uated at X = X. 


Suppose all involved parameters are identifiable, i.e., the 
mappings 9 i—>■ p{X\9) and ip i—>■ p{Y\X,ip) are both one-to- 
. Then the mapping between 9*^^^ and logp*’-*'^ (X = 


14 


x) and that between ip*^^^ and logp**^**^ (F|X = x) are both 
one-to-one. Hence, the independence between 9*^^'* and ip *^^^, 
6 = l,...,i3, implies that between logp*^*’^(X = x) and 


As a consequence, in nonparametric settings, we can imagine 
that there exist effective parameters 9 and ip, and can still as¬ 
sess where they follow a Bayesian cut by testing for indepen¬ 
dence between the bootstrapped estimates log35*^*’^(X = x) 
and Hy^|*^(x). Note that in the nonparametric case, the “pa¬ 
rameters” 9 and ip are not observable. The previous argu¬ 
ment shows that if there exists {9, ip) admitting a Bayesian 
cut, logp*^*’^(X = x) and ify^|^(x) are independent; other¬ 
wise they are always dependent. In words, testing for inde¬ 
pendence between the bootstrapped estimates logp**-*”^ (X = 
x) and is actually a ways to assess the exogeneity 

condition. Algorithm sunmmarizes the proposed proce¬ 
dure to determine the causal direction between X and F, 
given the sample (x, y) as input. In particular, it involves 
the following two modules. 


Bootstrap has been used in the literature to assess the de¬ 
pendence, as well as uncertainty, in the parameter estimates 
according to the sampling distribution; see e.g. Sec. 5.7]. 
For clarity. Table gives the notation used in the proposed 
bootstrap-based method. Suppose we draw bootstrap re¬ 
samples (x*^*’\ y**-^^), 6 = 1,..., B, from the original sample 
(x,y) = {xi,yi)fLi with paired bootstrap, i.e., each resam¬ 
ple (x***"^,is obtained by independently drawing N 
pairs from the original sample with replacement. On each 
of them, we can calculate the parameter estimates 9*^^^ and 
ip*(b) ^ The independence between 9 and ip according to the 
sampling distribution is then transformed to statistical inde¬ 
pendence between the bootstrap estimates 9*^^'' and i/’**'*’', 
6 = To assess the latter, any independence test 

method, such as the correlation test, would apply. 


3.2 Bootstrap-Based Test for Exogeneity in the 
Nonparametric Case 

Let X be a fixed set of values of X, and Xi be a point in 
X. X can be drawn from the given data set, or randomly 
sampled on the support of X, given that it contains enough 
points such that the values of P{X) and p(F|X) evaluated 
at X well approximate the continuous densities. In our ex¬ 
periments we used 80 evenly-spaced sample points between 
the minimum and maximum values of X as x (so its length 


3.2.1 Module 1: Nonparametric Estimators of p{X) 
andp{Y\X) 

When testing for exogeneity, one assumes the (paramet¬ 
ric) model is correctly specified. Otherwise, if the model is 
over-simplified, the estimated conditional distribution will 
depend on the marginal, which inspires the importance¬ 
reweighting scheme to handle learning problems under co¬ 
variate shift (see e.g.. Footnote 1 in 24 ). For example, let 
us consider the situation where F depends on X in a non¬ 
linear manner while a linear model is exploited to estimate 
Py\x', clearly the estimate of the parameters in the condi¬ 
tional model would depend on that in px. To avoid this, 
we use flexible nonparametric models to estimate the condi¬ 
tional. 


Suppose we aim to verify if X exogenous for effective “pa¬ 
rameters” in P(F|X). We need to estimate the marginal 
distribution p{X) and the conditional distribution p(F|X) 
on the original sample as well as each bootstrap resample. 
We estimate p(X) with Gaussian kernel density estimation, 
and the kernel width was selected by Silverman’s rule of 
thumb [22[ page 48]. 

To estimate the conditional density p(F]X), we adapted 
the method orignally proposed for causal inference based 
on the structural equation F = f{X,E) [^. This method 









Table 1: Notation involved in the proposed method based on exogeneity and bootstrap 


(x,y) 

(x-W,y*W) 

j 5*(6)(X = x) 
p*(b)(^Y\X ^ x) 


given sample of {X, Y) 

6 th bootstrap resample 

estimate of parameters 6 and ^ on (x*^*’\ y**-^^) 

marginal densities estimated on y**-*’*) evaluated at X = x 

conditional densities estimated on (x*^*'\ evaluated at X = x 

quantity associated with {Y\X = x), defined as Ey|x logp**-^^ (X|X = x) on (x*^*’\ 


Algorithm 1 Finding causal direction between X and Y 
based on exogeneity 
Input: data (x, y) 

Output: three possibilities: causal direction between X 
and Y, or non-identifiable causal direction by exogeneity, 
or existence of hidden confunders 
If_Exogeneity(X —>• Y) 
lF_ExOGENEITY(y —>• X) 

if exogeneity holds for only one direction then 
return the direction in which exogeneity holds 
else if exogeneity holds for both directions then 

print non-identifiable causal direction by exogeneity 
else t> exogeneity does not hold in either direction 

print confounder case 
end if 


procedure If_Exogeneity(X — >• Y) 
for 6 = 1 to S do 

draw bootstrap resample ,y*^**^) by random 
sampling with replacement from {xi,yi); 

estimate ~ ^y\‘x (^) h methods 

given in Sec. |3.2.1| 
end for 

test for independence between p*^^\X = x) and 
Jfy|*^(x), 6 = 1,B, with the method given in Sec. 
return independence test result 

end procedure 


3.2.2 


aims to find the functional causal model Y = f(X,E), 
where E X X, given (x, y). Without loss of generality, 
one can assume that E ~ A/'(0,1). (Otherwise, one can 
always write E = g{E) where g is some appropriate func¬ 
tion and E ~ N{0, 1), and use the functional causal model 
Y = f (^X, g{E)^ instead.) Here / is completely nonpara- 
metric: it takes a Gaussian process prior with zero mean 
function and covariance function k{^{x, e), {x',e')), where k 
is a Gaussian kernel, and (x,e) and {x',e') are two points 
of (X, E). Like in [^, this method optimizes the values 
of E, denoted by ii, as well as involved hyperparameters, 
and produces the maximum a posterior (MAP) solution of 
/, by maximizing the approximate marginal likelihood. The 
functional causal model implies the conditional density: 


P{Y\X) 


p{X,Y) 

P(X) 


p{X,E)/\%\ 

P{X) 



Finally, once we have the Ci and the estimate of /, the con¬ 
ditional density at each point can be estimated as p(y = 

yi\X = Xi) = p(^E = Ci)/ (xi, Ci) . 


3.2.2 Module 2: Testing for Independence Between 
High-Dimensional Vectors 

The task is then to test for independence between the esti¬ 
mated quantities on the bootstrap resamples, logp*^*’^(X = 
x) and iLy|*^(x), 6 = l,...,i3. Their dimentions are the 
number of data points in x, which is 80 in our experiments. 

Let R be the matrix consisting of the centered version of 
logp*^*’^(X = Xi), obtained on all bootstrap resamples, i.e., 
the (i, b)th entry of R is 

-I B 

Rib = log35(X*'*''(X = Xi) - — '^\ogp^'"\X = Xi). 

k=\ 

Similarly, S contains the centered version of Hy^^lxi), i.e., 

S,b^H*^^l{xi)-^j^H<^^{xi). 

k=\ 

Both R and S are of the size NxB. We define the statistic as 
Cx^Y = Tr((RS^)(RS^)^) = Tr(R^R-S^S), which is ac¬ 
tually the sum of squares of the covariances between all rows 
of R and those of S. The distribution of this statistic under 
the null hypothesis that logp*^*’^(X = x) and Ry|*^(x) are 
independent can then be constructed by permutation test. 

Note that this statistic is actually the Hilbert-Schmidt inde¬ 
pendence criterion (HSIC) with a linear kernel. That is, 
we care about linear dependence between logp**'^^(X = x) 
and iLy|*(^(x); this is reasonable because they are in the 
vicinity of the maximum likelihood estimates and their de¬ 
pendence can be captured by linear approximation. On the 
other hand, if we use HSIC with Gaussian kernels, the re¬ 
sult will be sensitive to the kernel width because the data 
dimension (the number of rows of R and S) is high. 

4. EXPERIMENTS 

In this section we first evaluate the behavior of the proposed 
bootstrap-based method for causal inference with synthetic 
data, on which the ground-truth is known, and then apply 
it on real data. We use two variables, and with synthetic 
data, we examine both the case where the two variables have 
a direct causal relation and the confounder case (i.e., there 
are confounders influencing both of them). We compare the 
proposed bootstrap-based approach with the additive noise 
model (ANM) proposed in I^), GPI [^, and information- 
geometric causal inference (IGCI) approach [^: ANM as¬ 
sumes that the effect is a nonlinear function of the cause plus 
additive noise, GPI applies the Gausian Process prior on the 
generating function, and IGCI assumes the transformation 
from the cause to the effect is deterministic, nonlinear, and 













independent from the distribution of the cause in a certain 
way. For computational reasons, we used 1000 bootstrap 
replications. 

Simulation: Without Confounders. Inspired by the 
settings in [^[^, we generated the simulated data with the 
model Y = (X + bX^)e°‘^ + (1 — ol)E, where X and E were 
obtained by passing i.i.d. Gaussian samples through power 
nonlinearities with exponent q, while keeping the original 
signs. The parameter a controls the type of the observation 
noise, ranging from purely additive noise {a = 0 ) to purely 
multiplicative noise (a = 1 ). b determines how nonlinear 
the effect of X is, and when 6 = 0 the model is linear. The 
parameter q controls the non-Gaussninity of X and E: q = I 
corresponds to a Gaussian distribution, and q > 1 and q < 
1 produce super-Gaussian and sub-Gaussian distributions, 
respectively. 

We considered three situations, in each of which two of q, b, 
and a were fixed and we see how the other changes the 
performance of different methods. For each combination 
of q, b, and a, we independently simulated 10 data sets 
with 500 data points]^ Fig. I shows the accuracy of the 
considered methods. One can see that the accuracy of the 
bootstrap-based approach is among or close to the best re¬ 
sults, indicating that it is able to perform causal inference 
in various situations. We note that in practice, the per¬ 
formance of the bootstrap-based approach depends on the 
number of bootstrap replications and the method used for 
conditional distribution estimation. Although due to com- 
putatioanl reasons, we did not try a larger number of boot¬ 
strap replications, generally speaking, the accuracy of the 
bootstrap-based method improves as the number of replica¬ 
tions increases. 

Simulation: With Confounders. We then include 
the confounder variable Z in the system, so that the causal 
structure is Z ^ X and {Z,X) —>■ Y. For simplicity, we as¬ 
sume that both X and Y are influenced by Z in a linear form: 
X = {2-l3)Ex+PZ, and Y = 0.3(2-,d) [{X+bX^)e^^+{1- 
a)E^ + PZ, where Ex, Z, and E were obtained by passing 
i.i.d. Gaussian samples through power nonlinearities with 
exponent q = 1.5, and /3 controls how strong the effect of Z 
is on both X and Y. We considered two situations: in one of 
them, we set a = 0 and 6 = 0 , i.e., the whole model is linear; 
in the other situation, a = 0.2, and 6 = 0.3, so the model 
contains both additive noise and multiplicative noise. We 
changed /? from 0 to 1, and Fig. shows the performances 
of the four methods in the two situations; note that for each 
value of P, the four bars (from left to right) correspond to 
the bootstrap-based method, GPI, IGCI, and ANM. In par¬ 
ticular, one can see that the bootstrap-based method tends 
to detect the presence of the confounder when its effect is 
significant. 

On Real Data. We applied the bootstrap-based method 

on the cause-effect pairs available at 

http://webdav.tuebingen.mpg.de/cause-effect/ 

To reduce computational load, we used at most 500 points 
for each cause-effect pair. On 20 pairs (pairs 21, 43, 45, 48- 
51, 56-58, 61-64, 72, 75, 77-79, and 81), the p-values of the 

®Since the bootstap-based approach is rather time- 
consuming, we only simulated 10 data sets for each setting. 



(a) Ghanging a: From additive to multiplicative noise 



(b) Changing q\ From sub-Gaussian to super-Gaussian 
additive noise 



(c) Changing 6: Various nonlinear functions with Gaussian 
additive noise 

Figure 3: Accuracy of correctly estimating the 
causal direction for different generating models: (a) 
g = 1, 6 = 1, and a changed from 0 to 1, (b) for a 
linear function (6 = 0) with additive noise (a = 0) 
which changed from sub-Gaussian {q < 1) to sub- 
Gaussian {q> 1), and (c) various nonlinear functions 
(6 changed from -1 to 1) with additive Gaussian noise 
(q = 1, o = 0). 

























I Correct direction I IConfounder Wrong direction 



\beta = 0 0.25 0.50 0.75 1.00 
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(a) Situation 1: Linear confounder case. 



(b) Situation 2: Nonlinear confounder case. 


their statistical properties. 

Our approach shows that it is possible to determine causal 
direction without structural constraints or a specific type of 
smoothness assumptions on the functional models. The pro¬ 
posed computational approach successfully demonstrated the 
validity of this idea, though it is computationally demand¬ 
ing because of the bootstrap procedure and its performance 
is not necessarily the best among existing methods. At the 
same time, it enjoys some advantages. First, it does not 
make a strong assumption on the data-generating process. 
Second, it could often tell us if significant confounders exist. 
The performance of the proposed bootstrap-based approach 
depends on the number of bootstrap replications and the 
method for conditional distribution estimation. In future 
work we aim to develop more reliable methods along this 
line, including methods that can handle more than two vari¬ 
ables. 


In this paper we made an attempt to discover causal in¬ 
formation from observational data based on a condition of 
exogeneity, which provides another perspective to concep¬ 
tualize the ’’independence” between the process generating 
the cause and that generating the effect from cause. On 
the other hand, it is worth mentioning that this type of in¬ 
dependence is able to facilitate understanding and solving 
some machine learning or data analysis problems. For in¬ 
stance, it helps understand when unlabeled data points will 
help in the semi-supervised learning scenario 20 , and in¬ 
spired new settings and formulations for domain adaptation 
by characterizing what information to transfer and how to 
do so (281 [251. 


Figure 4: Number of replications in which the meth¬ 
ods find correct directions, report existence of con- 
founders, and give wrong directions, respectively. 

For each value of p, the four bars correspond to bootstrap- 
based method, GPI, IGCI, and ANM (from left to right). 

independence test for both directions are smaller than 0.01, 
indicating that there might be significant confounders. This 
seems reasonable, as the data scatter plots for these pairs 
indicate that the two variables have complex dependence 
relationships. On the remaining 57 data sets, the bootstrap- 
based method output correct causal directions on 41 of them 
(with an accuracy 72%). We also applied the recently pro¬ 
posed causal inference approaches, including IGCI [10] , the 
approach based on the Gaussian process pri or [15] , and that 
based on the post-nonlinear causal model [27| on those 57 
data sets for comparison. Their performance was similar: 
the three approaches gave correct causal directions on 41, 
40, and 43 pairs, respectively. 

5. CONCLUSION AND DISCUSSIONS 

We proposed to do causal inference based on the criterion of 
exogeneity of the cause for the parameters in the conditional 
distribution of the effect given the cause. We discussed how 
to assess such exogeneity in nonparametric settings. To this 
end, one needs to draw a number of samples according to the 
unknown data-generating process. Fortunately, the boot¬ 
strap provides a way to mimic the data generating process 
from which we can draw a number of samples and analyze 
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Supplement to 

“Distinguishing Cause from Effect Based on Exogeneity” 


This supplementary material provides the proofs 
and discussions which are omitted in the sub¬ 
mitted paper. The equation numbers in this 
material are consistent with those in the paper. 


SI. Mutual Exogeneity and Its Relationship to 
Definition [1] 

There are two types of analysis of exogeneity ; one consid¬ 
ers the inference based on the complete sample results, and 
the other considers dynamic models where the data were ob¬ 
tained by “sequential sampling”. In this paper we focus on 
the former scenario. 


The following theorem, extracted from [^, relates the Bayesian 
cut to the independence of the parameters according to the 
posterior distribution, as well as mutual exogeneity. 

Theorem 4. Suppose [p, {X, 0)] operates a Bayesian cut 
in p{X, Y, {ip, 0}); then 

(i) X and ip are mutually exogenous, and 
(a) Ip and 8 are independent a posteriori. 

On the other hand, if X and ip are mutually exogenous and 
if 8 A. ip\X, [ip, {X, 0)] operates a Bayesian cut. 


From the Bayesian point of view, exogeneity of X for ip al¬ 
lows an admissible reduction of the complete model p{X, Y\8,ip) 
to the conditional modelp(y|X, ip), in that both models lead 
to he same posterior distribution on the parameter set ip 


16 . Below we give the definition of mutual exogneneity ac¬ 


cording to . 


Definition 3 (Mutual exogeneity). X and ip are mu¬ 
tually exogenous if and only if 

(i) Ip and X are independent, i.e., ip dL X, and 
(a) Ip is sufficient in the conditional distribution ofY given 
X, i.e., 8 X Y\{ip,X). 


When one (or more) condition in Definition is violated, 
[ip,{X,8)] does not operate a Baysian cut, i.e., X is not 
exogenous for ip. Fig. [^b-d) shows the situations where 
conditions (i), (ii), and (iii) are violated, respectively, so 
that [ip, {X, 8)] does not operate a Baysian cut. Note that 
by reparameterization, the three situations can reduce to 
each other. Take situations (b) and (c) as an example. If we 
divide 8 in (b) into {d-f,d±), where 8j depends on 7 while 
8±_ does not, and consider 8a_ as the new 8, (b) becomes (c). 
Similarly, if we merge 7 and 8 in (c) as the new 8, we then 
have (b). In all those situations, 8 and ip are not independent 
a posterior, or the maximum likelihood estimates 8 and ip 
are not independent according to the sampling distribution. 


Here condition ('f) is to do with the independence between ip 
and X ; those two quantities play different roles in the model 
p{X,Y,ip,8), and consequently this independence condition 
is usually not convenient to verify. Moreover, for the same 
reason, there is no fully equivalent concept in sampling the¬ 
ory (it is weaker than exogeneity defined in Definition 
because the property of 8 is not specified). A natural way 
of obtaining the mutual exogeneity of X and ip is to ex¬ 
ploit a stronger but more operational condition, namely the 
condition of the Bayesian cut. 

A Bayesian cut allows a complete separation of inference (on 
parameters 8) in the marginal distribution and of inference 
(on Ip) in the conditional one. The prior independence be¬ 
tween 8 and ip in the Bayesian cut is a counterpart to the 
variation-free condition in the classical cut (condition (ii) 
in Definition!^, and the last two conditions in Definition]^ 
implies condition (i) in Definition]^ Thus, the Bayesian 
cut is equivalent to the classical cut in sampling theory, and 
consequently characterizes the exogeneity property defined 
in Definition]^ Therefore, hereafter the exogeneity of X for 
Ip is used interchangeably with the statement that [ip, {X, 8)] 
operates a Bayesian cut in p{X, Y, 8, ip). 


S2. Relation to SEM-Based Causal Inference 
S2.I. Relation to Causal Inference Based on 
Marginal Likelihood 

Recently, SEM-based approaches have demonstrated their 
power for causal inference of real-world problems. Struc¬ 
tural equations represent the effect as a function of the 
causes and independent noise, which, from another point of 
view, provide a way to represent the conditional distribution 
P(effect|cause), or the causal mechanism. The generation 
of the cause-effect pair consists of two stages, one generating 
the cause according to P(cause) and the other further gen¬ 
erating the effect from the value of the cause according to 
the structural equation. The “simplicity” constraints (e.g., 
linearity in ]^, additive noise in ]^, the post-nonlinear pro¬ 
cess in ]^, and the smoothness assumption in ]^) on the 
functions are crucial. On the one hand, they make the mod¬ 
els asymmetric in cause and effect; otherwise, for any two 
variables, we can always represent one of the variables as a 
function of the other and an independent noise term ]^. On 
the other hand, if the functions are constrained to be sim¬ 
ple, the independence between the cause and the error terms 
would imply the exogeneity of the cause for the parameters 
in P(cause), as suggested by the error-based definition of 



exogeneity 17 (see also [l^)p] 


The concept “exogeneity” provides theoretical support for 
the SEM-based causal inference methods that find the causal 
direction by comparing the marginal likelihood of the m ode ls 
in two directions; for an example of such methods, see 15 ^ 
One candidate model is given in Fig. [^a), where X is ex¬ 
ogenous for (or [ip, {X,'ip)]) operates a Bayesian cut in 
p{X, Y, 6, Ip), denoted by A4i. The other corresponds to the 
factorization: 


p{X,Y\9,iP)=piY\e)p{X\Y,ip), (2) 

where [ip,{Y,6)] operates a Bayesian cut in p{Y, X,9,ip), 
denoted by M 2 - Note that under the above models, the 
marginal likelihood of [X, Y) is the product of that of the 
conditioning variable and that of the conditional distribu¬ 
tion. Ideally, if all the involved distributions are correctly 
specified, one would prefer the causal direction X ^ Y 
(resp. Y —>■ X) if M\ (resp. M 2 ) gives a higher marginal 
likelihood. 


where 9 and ip have independent priors. As the sample 
size N goes to infinity, for any choice of p'^{9\M2) and 
p°{ip\M 2 ), p(X,Y|A1i) is always greater than p{X.,Y\M 2 ) ■ 

Proof. As the data were generated according to model 
All, we have 

E\ogp{X,Y\Mi) = J p{X,Y\Mi)\ogp{X,Y\Mi)dxdy. 
Furthermore, 

Elogp{X,Y\Mi) -ElogpiX,Y\M 2 ) 

= lp(X,Y\Mi)logP^^P^^dxdy 

=V{piX,Y\Mi) II p(X,Y\M2)), 

where T){■[[■) denotes the Kullback-Leibler divergence. Clearly 
the above quantity is non-negative, and it is zero if and only 
if p{X, yiAli) = p{X, y|Al 2 ) for all possible x and y. How¬ 
ever, this condition cannot hold, because the model Ali is 
assumed to be identifiable based on exogeneity. 


Theorem 5. Suppose that the two random variables X 
andY are generated according to M\, and that the exogeneity- 
based causal model is identifiable. Let the prior distributions 
of the parameters be p*{ip\Mi) andp*{9\Mi). For the given 
sample (X,Y), let p(X.,Y\Mi) be the marginal likelihood, 
i.e., 


p{X,Y\Mi) 

N 


n /\ p(Xi, y IW, e})p*{e\Mi)p*{ip\Mi)Aedip 

N N 

= n PiX^\e)p*{e\Ml)d9.Yl p{Yi\Xi,iP)p*(iP\Mi)diP 

i=l'^ i^l'^ 

N N 

= Y{p{Xi\Mi).Y{p(Yi\Xi,Mi). 


Consequently, we have E logp(Y, y|Ali) > E logp(X, y |7Vf2) 
Moreover, according to the weak law of large numbers, as 
Y —>• 00 , logp(X, Y| All) and ^ logp(X, YIAI 2 ) will con¬ 
vergence in probability to the quantities E logp(X, y |7Vfi) 
and E logp(X, y|7Vf2), respectively. That is, if N is large 
enough, p(X, Y|Mi) > p(X, YIM 2 ). □ 


However, the marginal likelihood depends heavily on the 
models or assumptions for the marginal and conditional dis¬ 
tributions. Besides the exogeneity property, such approaches 
also make additional assumptions about the functions, such 
as structural constraints [21[ and the smoothness as¬ 

sumption The proposed approach avoids such assump¬ 
tions, by directly assessing the exogeneity property. 


Assume that by a one-to-one reparametrization we can rep¬ 
resent p{X,Y\{ip,9}) as p{Y\9)p{X\Y,ip), where Y is not 
exogenous for Ip. Tef p(X, Y| AI 2 ) be the marginal likelihood 
of M 2 , i.e.. 


p(X,Y|At 2 ) 

N 

= n p{Xi,Y,\{fP,e})p°(e\M2)p°{i>\M2)dMip 

i=\ 

N N 

= Y{p{Yi\M2) ■Y{p{Xi\Yi,M2), 

i=\ i=l 


^An erro r-ba sed definition of exogeneity was given by 17 
(see also [18| ): X is said to be exogenous for parameters m 
p{Y\X) is A is independent of all errors that influence Y, 
except those mediated by X. We know that without appro¬ 
priate constraints on the functions, given any two random 
variable, we can always represent one of them as a function 
of the other variable and an independent noise term [^, i.e., 
the functional causal models are not identifiable. Therefore, 
generally speaking, the above error-based definition is con¬ 
sistent with Definition only when the functional class is 
well constrained. Otherwise, if the function and the distri¬ 
bution of the assumed cause are related in some way, the 
above definition is not rigorous. 

®Note that due to computational difficulties, this method 
doe snot evaluate the marginal likelihood, but approximate 
it wiht the maximum regularized likelihood. 


S2.L1. A Simple Illustration on Parametric Models 
with Laplace Approximation 

Here we use a somehow oversimplified parametric example 
to illustrate why the marginal likelihood implies the causal 
direction. Assume that M\ holds, that is, in factorization 
0 > X is exogenous to ip. We will demonstrate that the 
likelihood for model § would be asymptotically smaller if 
we wrongly assume that Y is exogenous for ip. We assume 
that there is a one-to-one correspondence between (9, ip) and 
(9, Ip). As seen from the proof of Theorem]^ the marginal 
distribution of 0 under Mi would be the same as that of 
0 with the dependence between 9 and ip taken into ac¬ 
count. Suppose that the corresponding log marginal like¬ 
lihood logp(X, Y| All), can be evaluated with the Laplace 
approximation in terms of (9, ip) 

logp(X, Y| Af 1 ) « logp(X, Y|l, ^) + logp°{9, ip) 

- + ^log(27r), 

where 9 and ip are the maximum a posterior (MAP) esti¬ 
mate, p^{9,ip) is the prior, Eg ^ is the negative Hessian of 

log[p(X.,Y\9, ip)p'^{9)p^(ip)] evaluated at (9, ip), and d is the 
number of parameters. 


12 









On the other hand, under M 2 , the negative Hessian matrix 
becomes Eg ^ which is block-diagonal and shares the same 
main diagonal block matrices Eg and E^ with Eg We then 
have logp(X,Y|Xi) - logp(X,Y|X 2 ) r. i(iog|%^| - 
log|Sg-_^|) = |(log|Eg-| -blog|E^| - log|Eg-,^|). One can 
show that |Eg d < |Eg| • |E^| if Eg ^ is not block-diagonal; 
foraproof, see [H page 239]. Hence, we have logp(X, YjAli) > 
logp(X, Y|Al 2 jasymptotically. 

S2.2. Relation to Invariance of SEMs 

The proposed bootstrap-based method provides a way to 
examine if an equation is structural or not. Suppose Y — 
f{X, E), where i X, is a structural causal model in that 
/ is invariant to changes in the distribution of X [18| . One 
can then see that since E and X are independent processes, 
the bootstrapped P*^^\X) is independent from the under¬ 
lying p*^^\E), and hence independent from {Y\X) = 
j5*W(j5)/||T|. 

Now consider the other direction. According to [^, we can 
always find an equation X = f{Y ; E) such that E X Y\ 
suppose this equation is not structural, in that /, or in 
particular, |^| is dependent on piY). Again, we have 
p‘^X(^X\Y) = p*^^\E )The bootstrapped p*^‘’\Y) 
and p*^’’\X\Y) are then dependent due to the dependence 
between j^j and p*^*’Vy). 

In particular, the SEM-based causal inference approaches |2H 
|27[ |15| constrain the functions / to be simple in respec¬ 
tive senses; consequently they are not so flexible as to change 
with the input distribution p{X), and then the independence 
between the input X and the noise E serves as a surrogate 
to achieve the exogeneity condition of X for the parameters 
in p{Y\X). 

Compared to SEM-based approaches, the proposed exogeneity- 
based approach avoids the constraints on the functional causal 
model /. On the other hand, some SEM-based approaches 
have clear identihability conditions under which the reverse 
direction Y —>■ X that induces the same joint distribution on 
(X, Y) does not exist in general, given the causal direction 
X —> Y; for instance, see [^[^. However, to find theoreti¬ 
cal identihability results for the proposed approach, one has 
to establish the identihability conditions in terms of data 
distributions, which turns out to be extremely difhcult. 


